1 Step 1: Define the Problem

Requirement:

  1. Download the data, and load it in Pycharm and provide initial overview information.

  2. Visualize the location of the car accidents.

  3. Find out the insight from the dataset (i.e. Location/ Time of Day).

  4. Take Weather Data into Consideration.

  5. Find out the potential car accident area given the current car location.

This time, I would leverage the power of R and Python to perform the analysis and present the result via both Rmarkdown (R) and jupiter notebook (python). The analysis would be based on a standard data science framework and answer the questions above; however, I would extend the scope of the analysis to identify any unique insight as well as provide detailed explanation of my code.

2 Step 3: Preprocess the Data

2.1 Dependencies

2.1.1 Required libraries

if (!require("pacman")) install.packages("pacman") 
pacman::p_load(tidyverse, DT, lubridate, leaflet, leaflet.extras, maps, data.table, ggthemes, rebus, clue, skimr, plotly)

2.1.2 Required Dataset

# Initially use read.csv then write the file so that going forward I can use fread
data <- read.csv("input/NYPD_Motor_Vehicle_Collisions.csv", stringsAsFactors = F)

2.2 First Glimpse

The first question can be answered by looking at the structure of the dataset. The dataset has 1089265 observations(rows) and 29 variables(columns).

2.2.1 First 20 rows with selected columns

data %>% 
  head(100) %>% 
  datatable(filter = 'top', options = list(
  pageLength = 15, autoWidth = TRUE
))

2.2.2 Structure

data %>% 
  glimpse()
## Observations: 1,089,265
## Variables: 29
## $ DATE                          <chr> "08/04/2017", "08/04/2017", "08/...
## $ TIME                          <chr> "0:00", "0:00", "0:00", "0:00", ...
## $ BOROUGH                       <chr> "QUEENS", "", "", "", "", "", ""...
## $ ZIP.CODE                      <int> 11436, NA, NA, NA, NA, NA, NA, 1...
## $ LATITUDE                      <dbl> 40.66689, 40.71995, 40.71867, 40...
## $ LONGITUDE                     <dbl> -73.79041, -74.00859, -73.96350,...
## $ LOCATION                      <chr> "(40.666885, -73.790405)", "(40....
## $ ON.STREET.NAME                <chr> "NORTH CONDUIT AVENUE           ...
## $ CROSS.STREET.NAME             <chr> "149 STREET", "", "", "", "", ""...
## $ OFF.STREET.NAME               <chr> "", "", "", "", "", "", "", "", ...
## $ NUMBER.OF.PERSONS.INJURED     <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0,...
## $ NUMBER.OF.PERSONS.KILLED      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NUMBER.OF.PEDESTRIANS.INJURED <int> 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,...
## $ NUMBER.OF.PEDESTRIANS.KILLED  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NUMBER.OF.CYCLIST.INJURED     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NUMBER.OF.CYCLIST.KILLED      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NUMBER.OF.MOTORIST.INJURED    <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ NUMBER.OF.MOTORIST.KILLED     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CONTRIBUTING.FACTOR.VEHICLE.1 <chr> "Unspecified", "Unsafe Lane Chan...
## $ CONTRIBUTING.FACTOR.VEHICLE.2 <chr> "Unspecified", "Unsafe Lane Chan...
## $ CONTRIBUTING.FACTOR.VEHICLE.3 <chr> "", "", "", "", "", "", "", "", ...
## $ CONTRIBUTING.FACTOR.VEHICLE.4 <chr> "", "", "", "", "", "", "", "", ...
## $ CONTRIBUTING.FACTOR.VEHICLE.5 <chr> "", "", "", "", "", "", "", "", ...
## $ UNIQUE.KEY                    <int> 3725017, 3725047, 3725533, 37248...
## $ VEHICLE.TYPE.CODE.1           <chr> "PASSENGER VEHICLE", "PICK-UP TR...
## $ VEHICLE.TYPE.CODE.2           <chr> "PASSENGER VEHICLE", "SPORT UTIL...
## $ VEHICLE.TYPE.CODE.3           <chr> "", "", "", "", "", "", "", "", ...
## $ VEHICLE.TYPE.CODE.4           <chr> "", "", "", "", "", "", "", "", ...
## $ VEHICLE.TYPE.CODE.5           <chr> "", "", "", "", "", "", "", "", ...

2.2.3 Skim

data %>% 
  skim() %>% 
  kable()
## Skim summary statistics  
##  n obs: 1089265    
##  n variables: 29    
## 
## Variable type: character
## 
## variable                        missing   complete   n         min   max   empty     n_unique 
## ------------------------------  --------  ---------  --------  ----  ----  --------  ---------
## BOROUGH                         0         1089265    1089265   0     13    297024    6        
## CONTRIBUTING.FACTOR.VEHICLE.1   0         1089265    1089265   0     53    4591      49       
## CONTRIBUTING.FACTOR.VEHICLE.2   0         1089265    1089265   0     53    141438    49       
## CONTRIBUTING.FACTOR.VEHICLE.3   0         1089265    1089265   0     53    1018239   44       
## CONTRIBUTING.FACTOR.VEHICLE.4   0         1089265    1089265   0     53    1073947   43       
## CONTRIBUTING.FACTOR.VEHICLE.5   0         1089265    1089265   0     43    1085506   33       
## CROSS.STREET.NAME               0         1089265    1089265   0     32    234538    15260    
## DATE                            0         1089265    1089265   10    10    0         1861     
## LOCATION                        0         1089265    1089265   0     25    207066    121618   
## OFF.STREET.NAME                 0         1089265    1089265   0     40    928678    75926    
## ON.STREET.NAME                  0         1089265    1089265   0     32    197137    9497     
## TIME                            0         1089265    1089265   4     5     0         1440     
## VEHICLE.TYPE.CODE.1             0         1089265    1089265   0     30    7384      18       
## VEHICLE.TYPE.CODE.2             0         1089265    1089265   0     30    166863    18       
## VEHICLE.TYPE.CODE.3             0         1089265    1089265   0     30    1020222   18       
## VEHICLE.TYPE.CODE.4             0         1089265    1089265   0     30    1074465   18       
## VEHICLE.TYPE.CODE.5             0         1089265    1089265   0     30    1085615   16       
## 
## Variable type: integer
## 
## variable                        missing   complete   n         mean         sd           p0      p25      p50       p75       p100      hist     
## ------------------------------  --------  ---------  --------  -----------  -----------  ------  -------  --------  --------  --------  ---------
## NUMBER.OF.CYCLIST.INJURED       0         1089265    1089265   0.02         0.14         0       0        0         0         4         <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NUMBER.OF.CYCLIST.KILLED        0         1089265    1089265   7.8e-05      0.0088       0       0        0         0         1         <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NUMBER.OF.MOTORIST.INJURED      0         1089265    1089265   0.19         0.63         0       0        0         0         43        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NUMBER.OF.MOTORIST.KILLED       0         1089265    1089265   0.00045      0.024        0       0        0         0         5         <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NUMBER.OF.PEDESTRIANS.INJURED   0         1089265    1089265   0.052        0.24         0       0        0         0         28        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NUMBER.OF.PEDESTRIANS.KILLED    0         1089265    1089265   0.00066      0.026        0       0        0         0         2         <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NUMBER.OF.PERSONS.INJURED       0         1089265    1089265   0.26         0.66         0       0        0         0         43        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NUMBER.OF.PERSONS.KILLED        0         1089265    1089265   0.0012       0.036        0       0        0         0         5         <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## UNIQUE.KEY                      0         1089265    1089265   2200694.89   1519164.88   22      274242   3180753   3453075   3726256   <U+2586><U+2581><U+2581><U+2581><U+2581><U+2581><U+2583><U+2587> 
## ZIP.CODE                        297136    792129     1089265   10810.88     565.88       10000   10128    11205     11236     11697     <U+2586><U+2581><U+2583><U+2581><U+2581><U+2587><U+2585><U+2581> 
## 
## Variable type: numeric
## 
## variable    missing   complete   n         mean     sd     p0        p25      p50      p75      p100    hist     
## ----------  --------  ---------  --------  -------  -----  --------  -------  -------  -------  ------  ---------
## LATITUDE    207066    882199     1089265   40.72    0.3    0         40.67    40.72    40.77    41.13   <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587> 
## LONGITUDE   207066    882199     1089265   -73.92   0.99   -201.36   -73.98   -73.93   -73.87   0       <U+2581><U+2581><U+2581><U+2581><U+2581><U+2587><U+2581><U+2581>

2.2.4 Summary

data %>% summary()
##      DATE               TIME             BOROUGH             ZIP.CODE     
##  Length:1089265     Length:1089265     Length:1089265     Min.   :10000   
##  Class :character   Class :character   Class :character   1st Qu.:10128   
##  Mode  :character   Mode  :character   Mode  :character   Median :11205   
##                                                           Mean   :10811   
##                                                           3rd Qu.:11236   
##                                                           Max.   :11697   
##                                                           NA's   :297136  
##     LATITUDE        LONGITUDE         LOCATION         ON.STREET.NAME    
##  Min.   : 0.00    Min.   :-201.36   Length:1089265     Length:1089265    
##  1st Qu.:40.67    1st Qu.: -73.98   Class :character   Class :character  
##  Median :40.72    Median : -73.93   Mode  :character   Mode  :character  
##  Mean   :40.72    Mean   : -73.92                                        
##  3rd Qu.:40.77    3rd Qu.: -73.87                                        
##  Max.   :41.13    Max.   :   0.00                                        
##  NA's   :207066   NA's   :207066                                         
##  CROSS.STREET.NAME  OFF.STREET.NAME    NUMBER.OF.PERSONS.INJURED
##  Length:1089265     Length:1089265     Min.   : 0.0000          
##  Class :character   Class :character   1st Qu.: 0.0000          
##  Mode  :character   Mode  :character   Median : 0.0000          
##                                        Mean   : 0.2556          
##                                        3rd Qu.: 0.0000          
##                                        Max.   :43.0000          
##                                                                 
##  NUMBER.OF.PERSONS.KILLED NUMBER.OF.PEDESTRIANS.INJURED
##  Min.   :0.000000         Min.   : 0.00000             
##  1st Qu.:0.000000         1st Qu.: 0.00000             
##  Median :0.000000         Median : 0.00000             
##  Mean   :0.001198         Mean   : 0.05247             
##  3rd Qu.:0.000000         3rd Qu.: 0.00000             
##  Max.   :5.000000         Max.   :28.00000             
##                                                        
##  NUMBER.OF.PEDESTRIANS.KILLED NUMBER.OF.CYCLIST.INJURED
##  Min.   :0.0000000            Min.   :0.00000          
##  1st Qu.:0.0000000            1st Qu.:0.00000          
##  Median :0.0000000            Median :0.00000          
##  Mean   :0.0006647            Mean   :0.02047          
##  3rd Qu.:0.0000000            3rd Qu.:0.00000          
##  Max.   :2.0000000            Max.   :4.00000          
##                                                        
##  NUMBER.OF.CYCLIST.KILLED NUMBER.OF.MOTORIST.INJURED
##  Min.   :0.0e+00          Min.   : 0.0000           
##  1st Qu.:0.0e+00          1st Qu.: 0.0000           
##  Median :0.0e+00          Median : 0.0000           
##  Mean   :7.8e-05          Mean   : 0.1862           
##  3rd Qu.:0.0e+00          3rd Qu.: 0.0000           
##  Max.   :1.0e+00          Max.   :43.0000           
##                                                     
##  NUMBER.OF.MOTORIST.KILLED CONTRIBUTING.FACTOR.VEHICLE.1
##  Min.   :0.000000          Length:1089265               
##  1st Qu.:0.000000          Class :character             
##  Median :0.000000          Mode  :character             
##  Mean   :0.000454                                       
##  3rd Qu.:0.000000                                       
##  Max.   :5.000000                                       
##                                                         
##  CONTRIBUTING.FACTOR.VEHICLE.2 CONTRIBUTING.FACTOR.VEHICLE.3
##  Length:1089265                Length:1089265               
##  Class :character              Class :character             
##  Mode  :character              Mode  :character             
##                                                             
##                                                             
##                                                             
##                                                             
##  CONTRIBUTING.FACTOR.VEHICLE.4 CONTRIBUTING.FACTOR.VEHICLE.5
##  Length:1089265                Length:1089265               
##  Class :character              Class :character             
##  Mode  :character              Mode  :character             
##                                                             
##                                                             
##                                                             
##                                                             
##    UNIQUE.KEY      VEHICLE.TYPE.CODE.1 VEHICLE.TYPE.CODE.2
##  Min.   :     22   Length:1089265      Length:1089265     
##  1st Qu.: 274242   Class :character    Class :character   
##  Median :3180753   Mode  :character    Mode  :character   
##  Mean   :2200695                                          
##  3rd Qu.:3453075                                          
##  Max.   :3726256                                          
##                                                           
##  VEHICLE.TYPE.CODE.3 VEHICLE.TYPE.CODE.4 VEHICLE.TYPE.CODE.5
##  Length:1089265      Length:1089265      Length:1089265     
##  Class :character    Class :character    Class :character   
##  Mode  :character    Mode  :character    Mode  :character   
##                                                             
##                                                             
##                                                             
## 

2.3 Data Cleaning: Correcting, Completing, Creating, and Converting

2.3.1 Correcting & Completing

As the data range section shows, some data entries for latitude and longitude are out of the scale and need to be corrected or removed.

data <- data %>% filter(LATITUDE>0, LONGITUDE<-72, LONGITUDE>-75)

3 Interactive Map

Looking at the summary result, I got the map below. It is very interesting to see that all the pick up location are outside of the core area of New York City. By doing a little research, I found out that the green taxi are only allowed to pick up passengers (street hails or calls) in outer boroughs (excluding John F. Kennedy International Airport and LaGuardia Airport unless arranged in advance) and in Manhattan above East 96th and West 110th Streets. That explains the pattern we see here.

set.seed(0)
data %>% 
  sample_n(size=5000) %>% 
  
  leaflet() %>% 
  addProviderTiles(providers$HikeBike.HikeBike, group = "color map") %>%
  addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
  addCircleMarkers(~LONGITUDE, ~LATITUDE, radius = 1,
                   color = "firebrick", fillOpacity = 0.001) %>%
  # addCircleMarkers(~Dropoff_longitude, ~Dropoff_latitude, radius = 1,
  #                  color = "steelblue", fillOpacity = 0.001, group = 'DropOff') %>%
  addLayersControl(
    baseGroups = c("Color map", "Light map"),
    # overlayGroups = c("PickUp", "DropOff"),
    options = layersControlOptions(collapsed = T)
  ) %>% 
  addSearchOSM() 
# %>% 
#   addReverseSearchGoogle()
#   addSearchFeatures(
#      targetGroups = c("PickUp", "DropOff"))

4 Interactive Map with Clustering

set.seed(0)
data %>% 
  sample_n(size=5000) %>% 
  
  leaflet() %>% 
  addProviderTiles(providers$HikeBike.HikeBike, group = "color map") %>%
  addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
  addCircleMarkers(~LONGITUDE, ~LATITUDE, radius = 1,
                   color = "firebrick", fillOpacity = 0.001,
                   clusterOptions = markerClusterOptions()) %>%
  # addCircleMarkers(~Dropoff_longitude, ~Dropoff_latitude, radius = 1,
  #                  color = "steelblue", fillOpacity = 0.001, group = 'DropOff') %>%
  addLayersControl(
    baseGroups = c("Color map", "Light map"),
    # overlayGroups = c("PickUp", "DropOff"),
    options = layersControlOptions(collapsed = T)
  ) %>% 
  addSearchOSM() 

4.0.1 Creating, and Converting

I converted datetime to time series data and created variables such as hour, weekday, weekend, etc.

Hour has value from 1 to 24, denoting 24 hours a day.

Weekday has value from Monday to Friday and is categorized as factor.

Weekend has value Weekday and Weekend.

data <- data %>%
         mutate(dateTime = mdy_hm(paste(DATE, TIME, sep = ' ')),
         weekday=as.factor(weekdays(dateTime)),
         weekend=if_else(weekday=='Saturday'|weekday=='Sunday','Weekend','Weekday'),
         hour = hour(dateTime)+1)

5 Step 4 Perform Exploratory Data Analysis (EDA)

5.1 Visualize Number of Accident by Time of Day

5.1.1 Number of Accident by Time of Day for both Weekday and Weekend

From an initial look at the number of accident by time of day graph, most of the accidents happened during the day with the peak ocurring around hour ending 17~18. The difference between the 8 and 9 is quite significant.

ggplotly(data %>% group_by(hour) %>% summarise(num_accident=n()) %>% 
  ggplot(aes(hour, num_accident, fill = num_accident)) + geom_col() +
  geom_label(aes(label=round(num_accident,1)), size=3.5, alpha=.7) +
  # coord_flip() +
  scale_x_continuous(breaks=seq(1,24,1)) +
  theme_economist() +
  theme(legend.position = 'none') +
  labs(title='Number of Accidents (Weekday and Weekdend)',subtitle='All Data Included (Weekday and Weekdend)',caption="source: Kaggle Open Source Data",
       y="Number of Accidents", x="Time of Day"))

5.1.2 Number of Accident by Time of Day for Weekday

Same as the observation from the full dataset, a slightly higher peak in the morning, which is presumably caused by the rush hours.

ggplotly(data %>% filter(weekend=='Weekday') %>% group_by(hour) %>% summarise(num_accident=n()) %>% 
  ggplot(aes(hour, num_accident, fill = num_accident)) + geom_col() +
  geom_label(aes(label=round(num_accident,1)), size=3.5, alpha=.7) +
  # coord_flip() +
  scale_x_continuous(breaks=seq(1,24,1)) +
  theme_economist() +
  theme(legend.position = 'none') +
  labs(title='Number of Accidents (Weekday)',
       y="Number of Accidents", x="Time of Day"))

5.1.3 Number of Accident by Time of Day for Weekend

For the weekend, the pattern changed and the peak is ocurring between hour ending 15 to 17.

ggplotly(data %>% filter(weekend=='Weekend') %>% group_by(hour) %>% summarise(num_accident=n()) %>% 
  ggplot(aes(hour, num_accident, fill = num_accident)) + geom_col() +
  geom_label(aes(label=round(num_accident,1)), size=3.5, alpha=.7) +
  # coord_flip() +
  scale_x_continuous(breaks=seq(1,24,1)) +
  theme_economist() +
  theme(legend.position = 'none') +
  labs(title='Number of Accidents (Weekend)',
       y="Number of Accidents", x="Time of Day"))

5.1.4 Combined Weekday and Weekend Number of Accidents

ggplotly(data %>%
  group_by(hour, weekend) %>%
  summarise(num_accident=n()) %>%
  ggplot(aes(hour, num_accident, color = weekend)) +
  geom_smooth(method = "loess", span = 1/2, se=F) +
  geom_point(size = 4) +
  labs(x = "Time of Day", y = "Number of Accidents") +
  scale_x_continuous(breaks=seq(1,24,1)) +
  theme_economist() +
  scale_color_discrete("Weekday vs. Weekend"))

5.2 Top 5 Accidents Locations on Weekdays and Weekend

5.2.1 Basic Set up

Rather than directing calculating the top 5 Accidents locations, I preprocessed the data a little bit. The logic is that if I directly use the longitude and latitude data, the same pick up spot with slightly different coordinates would be treated as different pick up locations and that would definitely deviate from the actual result. Therefore, I round the longitude and latitude to the 3 decimals from which the coordinates with slightly different number would be treated as one spot. I also used a green cab icon to denote the accident spots. The graph is interactive and can be zoom in and out. If you place the mouse on the green cab icon, it would show how many accidents at the location based on the dataset.

round_num <- 3

Weekday_Top5 <- data %>% filter(weekend=='Weekday') %>% 
  group_by(lng=round(LONGITUDE,round_num),lat=round(LATITUDE,round_num)) %>% 
  count() %>% arrange(desc(n)) %>% head(5)


Weekend_Top5 <- data %>% filter(weekend=='Weekend') %>% 
  group_by(lng=round(LONGITUDE,round_num),lat=round(LATITUDE,round_num)) %>% 
  count() %>% arrange(desc(n)) %>% head(5)

greentaxi <- makeIcon(
  iconUrl = "https://i.imgur.com/6rw618Q.png",
  iconWidth = 38, iconHeight = 35,
  iconAnchorX = 19, iconAnchorY = 39
)

5.2.2 Weekday Top 5 Pick up locations

There are the top 5 pick up locations during weekdays.

  1. 71st Ave and Queens Blvd. (13,987 pick ups in Feb 2016)

  2. E 125th St and Park Ave. (13,235 pick ups in Feb 2016)

  3. Broad Way and Roosevelt Ave. (12,566 pick ups in Feb 2016)

  4. Madison Ave and E 101st St. (7,198 pick ups in Feb 2016)

  5. Bedford Ave and N 7th St. (6,105 pick ups in Feb 2016)

Weekday_Top5 %>%
  leaflet() %>% 
  addProviderTiles(providers$HikeBike.HikeBike, group = "color map") %>%
  addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
  # addProviderTiles(providers$Stamen.Toner, group = "white map") %>% 
  addScaleBar() %>%
  addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
  addCircleMarkers(~lng, ~lat, radius = 1,
                   color = "firebrick", fillOpacity = 0.001) %>%
  addMarkers(~lng, ~lat, icon = greentaxi, label = ~as.character(paste("Number of Accidents:",Weekday_Top5$n))) %>%
  addLayersControl(
    baseGroups = c("Color map", "Light map"),
    options = layersControlOptions(collapsed = FALSE)
  )

5.2.3 Weekend Top 5 Pick locations

  1. Broad Way and Roosevelt Ave. (6,465 pick ups in Feb 2016)

  2. 71st Ave and Queens Blvd. (5,249 pick ups in Feb 2016)

  3. E 125th St and Park Ave. (4,788 pick ups in Feb 2016)

  4. Wythe Ave and N 11th St. (4,507 pick ups in Feb 2016)

  5. Bedford Ave and N 7th St. (2,768 pick ups in Feb 2016)

Weekend_Top5 %>%
  leaflet() %>% 
  addProviderTiles(providers$HikeBike.HikeBike, group = "color map") %>% 
  addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
  # addProviderTiles(providers$Stamen.Toner, group = "white map") %>% 
  addScaleBar() %>% 
  addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
  addCircleMarkers(~lng, ~lat, radius = 1,
                   color = "firebrick", fillOpacity = 0.001) %>%
  addMarkers(~lng, ~lat, icon = greentaxi, label = ~as.character(paste("Number of Accidents:",Weekend_Top5$n))) %>%
  addLayersControl(
    baseGroups = c("Color map", "Light map"),
    options = layersControlOptions(collapsed = FALSE)
  )

6 Step 5 Modelling

To recommend a pick up spot, I leverage the power of unsupervised learning by using a simple Kmeans model to group the pick up spots into 50 groups. Each of the pick up locations

6.1 Recommend to Find Pick Up Spot

6.1.1 Preprocess the data

According to the dictionary, there are two types of trip - street-hail and dispatch. For this question, we should only focus on the street-hail and exclude the dispatches.

data_coord <- data %>% select(LONGITUDE, LATITUDE)
data1 <- data

I used kmeans model to classify the coordinates into 50 groups.

set.seed(0)
data_kmeans <- data_coord %>% kmeans(50,nstart=20)

data1$cluster <- data_kmeans$cluster

pal <- colorNumeric(
  palette = "Blues",
  domain = data$cluster)

I sampled 10,000 observations and put them on the map.

So far, I answered the first three questions. To answer the last question, I would leverage the power of shiny app and make an interactive graph with the input option for longitude and latitude. Then, I would use the kmeans model to predict which cluster the input location would be in and focus on the pickup points within that cluster. Final, I would pick top 20 pick up points to recommend and the coordinate of the closest pick up spot among the Top 20.

Please found these result from the Shiny app below.

set.seed(0)
data1 %>% sample_n(size=10000) %>% 
  leaflet() %>% 
  addProviderTiles(providers$HikeBike.HikeBike, group = "color map") %>%
  addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
  # addProviderTiles(providers$Stamen.Toner, group = "white map") %>% 
  addScaleBar() %>%
  addCircleMarkers(~LONGITUDE, ~LATITUDE, radius = 1,
                   color = ~pal(cluster), fillOpacity = 0.001) %>%
  addLayersControl(
    baseGroups = c("Color map", "Light map"),
    options = layersControlOptions(collapsed = FALSE)
  )

7 Shiny App

7.1 The final mission to Answer Question 4

I set up the input options for longitude and latitude with sliders. Once that data is input, the program would make a prediction, for which cluster it belongs to, based on the input and kmeans model. Then, it would give 20 recommended pick up spots within the cluster as well as the closest pick up spot among the Top 20.

Please be awared that the graph below is just the screenshot of the actual interactive graph, since Shiny app is not available on Kaggle at the moment.